This document explores the prosper Loans dataset. The data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others. The data set was last updated on 03/11/2014
In this this project we will explore the characteristics of variables that can affect the loan status and to get some ideas about the relationships among multiple variables using summary statistics and data visualizations.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
import matplotlib.cm, matplotlib.colors
import seaborn as sb
import geopandas as gpd, folium, requests,fiona, branca, json
import missingno as msno
from wordcloud import WordCloud,STOPWORDS
from PIL import Image
import urllib,requests,random,palettable
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly_express as px
%matplotlib inline
%config InlineBackend.figure_format ='retina'
#from folium import features
#mask = np.array(Image.open(requests.get('http://pluspng.com/img-png/file-usa-map-black-png-2250.png', stream=True).raw))
#wc=loan.Occupation.value_counts().to_dict()
#wordcloud=WordCloud(width = 512, height = 512,background_color='black', stopwords=STOPWORDS,mask=mask).generate_from_frequencies(wc)
#plt.figure(figsize=(8,6),facecolor = 'white', edgecolor='blue')
##plt.tight_layout(pad=0)
#plt.imshow(wordcloud, interpolation='bilinear')
#plt.axis("off")
##plt.tight_layout(pad=0)
#plt.show()
"""There were issues with the image files, encountered a ValuError:bad transparency mask
https://stackoverflow.com/questions/31273592/valueerror-bad-transparency-mask-when-pasting-one-image-onto-another-with-pyt"""
prosper=("#FE7702","#E10C79")
#from palettable.colorbrewer.sequential import PuRd_6
#def grey_color_func(word, font_size, position, orientation, random_state=None, **kwargs):
# return "hsl(0, 0%%, %d%%)" % random.randint(10, 50)
#def color_func(word, font_size, position, orientation, random_state=None, **kwargs):
# return tuple(PuRd_6.colors[random.randint(2,8)])
def prosper_hue(word,font_size,position,orientation,random_state=None, **kwargs):
return np.random.choice(prosper)
#cell block containing various color elements used in notebook
def set_my_style():
sb.set()
hue_palette=sb.color_palette(prosper_hue2)
sb.set_palette(hue_palette)
prosper_hue2 =['#84807A',"#FE7702","#E10C79"]
prosper_hue3 =['#84807A',"#FE7702","#bfff00","#E10C79","#0000ff","#FFC300","#4A235A"]
#used palettable colors because it's more vibrant than the one seaborn has to offer
prosper_hue3_plasma=palettable.matplotlib.Plasma_3.hex_colors
prosper_hue5_plasma=palettable.matplotlib.Plasma_5_r.hex_colors
prosper_hue_medium=palettable.matplotlib.Inferno_8.hex_colors
prosper_hue_medium_plasma=palettable.matplotlib.Plasma_8.hex_colors
prosper_hue_medium_plasma_r=palettable.matplotlib.Plasma_8_r.hex_colors
prosper_hue_11=palettable.matplotlib.Inferno_11.hex_colors
prosper_hue_11_plasma=palettable.matplotlib.Plasma_11.hex_colors
prosper_hue_many=palettable.matplotlib.Inferno_20.hex_colors #for plots with many distinct categories
prosper_hue_many_plasma=palettable.matplotlib.Plasma_11_r.hex_colors
Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.
#load the loan data into a dataset
loan=pd.read_csv('prosperLoanData.csv')
loan5=loan.copy()
loan.shape
loan.info()
sum(loan.duplicated())
msno.bar(loan,fontsize=14,color='#84807A')
pd.set_option("display.max_columns", len(loan.columns))
loan.sample(5)
change the datatypes of the LoanOriginationDate and the ListingCreationDate to datetime dtypes
print(loan['LoanOriginationDate'].dtype)
print(loan.ListingCreationDate.dtype)
loan['LoanOriginationDate']=pd.to_datetime(loan['LoanOriginationDate'])
loan['ListingCreationDate']=pd.to_datetime(loan['ListingCreationDate'])
assert loan.LoanOriginationDate.dtype == np.dtype('<M8[ns]') and loan.ListingCreationDate.dtype == np.dtype('<M8[ns]')
loan.LoanOriginationDate.min()
I am interested in the Rating scores(both old and new) so I explored the relevant cloumns and formatted them as appropriate
#cant attach .dropna() because CreditGrade and any ProsperScore do not have any intersecting values thus error
loan.loc[:,["ProsperRating (numeric)","ProsperRating (Alpha)","ProsperScore","CreditGrade"]].sample(5)
ProsperScore: A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009.
ProsperRating (numeric): The Prosper Rating assigned at the time the listing was created: 0 - N/A, 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA. Applicable for loans originated after July 2009.
ProsperRating (Alpha): The Prosper Rating assigned at the time the listing was created between AA - HR. Applicable for loans originated after July 2009.
CreditGrade: The Credit rating that was assigned at the time the listing went live. Applicable for listings pre-2009 period and will only be populated for those listings.
sorted(loan.ProsperScore.unique())
sorted(loan['ProsperRating (numeric)'].unique())
sorted(loan['ProsperRating (Alpha)'].unique()[1:],reverse=True)
CG=loan['CreditGrade'].unique().tolist()
CG
del CG[1]
sorted(CG,reverse=True)
loan[['ProsperScore',"ProsperRating (numeric)","ProsperRating (Alpha)"]][loan.ProsperScore== 11.0].sample(10)
# "ProsperRating (numeric)","ProsperRating (Alpha)","ProsperScore","CreditGrade" into ordered categorical types
ordinal_var_dict = {'ProsperScore': [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0],
'ProsperRating (numeric)': [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
'ProsperRating (Alpha)': ['HR', 'E', 'D', 'C', 'B', 'A', 'AA'],
'CreditGrade':['NC', 'HR', 'E', 'D', 'C', 'B', 'A', 'AA']
}
for var in ordinal_var_dict:
ordered_var = pd.api.types.CategoricalDtype(ordered = True,
categories = ordinal_var_dict[var])
loan[var] = loan[var].astype(ordered_var)
loan.describe()
There are 113937 unique entries in this dataset and 81 different variables. Most of the variables are numeric with a few categorical variables. The variables can be grouped into four broad categories:
- Loan Status : The status of the loan list, such as Cancelled, Charged off, Completed, Current, Defaulted, Final Payment In Progress, Past Due.
- Borrower Data : Basic properties about borrowers such as income, occupation, employment status, etc.
- Loan Data : Basic properties about the loan such as length of the loan(term), Borrower APR, etc.
- Credit Risk Metrics : Metrics measured the risk of loans, such as Credit grade, Prosper Score, bank card utilization, etc.
Since this is a lending platform, the features of interest to me are those that helps in answering credit analysis questions. i.e determining an entity’s debt servicing capacity, or its ability to repay.
Credit Analysis is the process of drawing conclusions from available data (both quantitative and qualitative) regarding the credit – worthiness of an entity, and making recommendations regarding the perceived needs, and risks.
I am curious to know what factors affect Loan performance, therefore i will be loking at LoanStatus, Prosper Score and CreditGrade
Technically speaking, the CreditGrade and ProsperScore are supposed to be the major determinants of the loan performance, because the credit rating/score are issued after analysis of other parameters. so firstly lets eplore the variables in our dataset that I feel will have a strong effect on how the loans will turn out, then we will see how they relate with CreditGrade and ProsperScore.
The features I fell will be exploring are ,Occupation, Location, IncomeRange, LoanStatus, LoanOriginalAmount, CurrentDelinquencies, IsBorrowerHomeowner, EmploymentStatus, BankcardUtilization, TotalProsperLoans, DebttoIncomeRatio, BorrowerAPR, BorrowerRate and finally CreditGrade and ProsperScore
Let's Explore Borrowers by Location¶
gdf = gpd.read_file("https://raw.githubusercontent.com/PublicaMundi/MappingAPI/master/data/geojson/us-states.json")
statesabbr=pd.read_json("https://gist.githubusercontent.com/mshafrir/2646763/raw/8b0dbb93521f5d6889502305335104218454c2bf/states_titlecase.json")
gdf.columns
statesabbr.columns
gdf1=pd.merge(left=gdf,right=statesabbr,how='left',on=['name'])
gdf1.head()
gdf1.info()
gdf1.rename(columns={'abbreviation':'BorrowerState'},inplace=True)
#gdf1.head()
loan1=loan.copy()
gdf_test_df=gdf1.copy()
loan_test_df=loan1.groupby('BorrowerState',as_index=False)['ListingKey'].count()
loan.BorrowerState.unique()
#loan_test_df
gdf_test_df=pd.merge(left=gdf_test_df,right=loan_test_df,how='left',on=['BorrowerState'])
gdf_test_df.head()
gdf_test_df.crs = fiona.crs.from_epsg(4326)
gdf_test_df[gdf_test_df.name=='District of Columbia']
gdf_test_df=gdf_test_df[gdf_test_df['name']!='District of Columbia']
gdf_test_df.loc[51,'ListingKey']=0
gdf_test_df.info()
gdf_test_df.id.dtype
#gdf_test_df.to_csv('prosper_loan_gdf_map.csv',index=False)
variable = 'ListingKey'
gdf_test_df=gdf_test_df.sort_values(by=variable, ascending=True)
colormap = folium.LinearColormap(colors=["red","orange","yellow","green"],vmin=gdf_test_df.loc[gdf_test_df[variable]>0, variable].min(),
vmax=gdf_test_df.loc[gdf_test_df[variable]>0, variable].max()).to_step(n=5)
centroid=gdf_test_df.geometry.centroid
m=folium.Map(location=[centroid.y.mean(), centroid.x.mean()], zoom_start=4, tiles='StamenWatercolor')
#tooltip=folium.features.Tooltip()
#folium.
folium.GeoJson(gdf_test_df[['geometry','name',variable]],
name="United States of America",
style_function=lambda x: {"weight":2, 'color':'black','fillColor':colormap(x['properties'][variable]), 'fillOpacity':0.2},
highlight_function=lambda x: {'weight':3, 'color':'black'},
smooth_factor=2.0,
#tooltip=tooltip
tooltip=folium.features.GeoJsonTooltip(fields=['name',variable],
aliases=['State','Number of loans by state'],
labels=True,
sticky=True,
)
).add_to(m)
colormap.add_to(m)
folium.LayerControl(autoZIndex=False, collapsed=False).add_to(m)
m
From the map visualisation, we can see that most Prosper borrowers come from California.
I wanted to find out why and I did a little search and it turns out, the company was founded in California, USA in 2005, it's headquarters is also San Francisco.
I thought the situation of their offices was a possible reason but then I checked their other office location: Phoenix, Arizona and there, they had just 1901 Prosper Borrowers so i abandoned that theory.
Lastly I checked U.S.A's population distribution by state and the Prosper Borrower number mirrors that.
loan.Occupation.value_counts().head()
img_link='https://raw.githubusercontent.com/tonyodiba/Prosper_Loans_data_exploration/master/files/Usa-pfadSvgTutorial.svg.png'
img_link2='https://raw.githubusercontent.com/tonyodiba/Prosper_Loans_data_exploration/master/files/326-3267244_prosper-loans-on-twitter-prosper-marke.png'
icon=Image.open(requests.get(img_link2,stream=True).raw).convert("RGBA")
mask = Image.new("RGBA", icon.size, (255,255,255))
mask.paste(icon,icon)
mask = np.array(mask)
wc=loan.Occupation.value_counts().to_dict()
wordcloud=WordCloud(width = 400, height = 400,
background_color='white',max_font_size=300,
random_state=123,stopwords=STOPWORDS,mask=mask).generate_from_frequencies(wc)
#wordcloud.recolor(color_func=color_func, random_state=7)
wordcloud.recolor(color_func=prosper_hue,random_state=53)
plt.figure(figsize=(10,10),facecolor = 'white', edgecolor='blue')
plt.imshow(wordcloud, interpolation='bilinear')
plt.tight_layout(pad=2)
plt.axis("off")
plt.show()
set_my_style()
#sb.set_palette(palette=cpal)
sb.set_context('notebook')
ax=loan.Occupation.value_counts()[19::-1].plot.barh(figsize=(8,6),color=prosper_hue2[0])
plt.title('Top 20 Listed Occupations')
plt.xlabel('count')
plt.ylabel('Occupation')
plt.show()
set_my_style()
sb.set_context('notebook')
#prosper_gradient=palettable.cmocean.sequential.Gray_20.hex_colors
#sb.set_palette(palette=prosper_gradient)
ax=loan.Occupation.value_counts()[2:].plot.bar(figsize=(12,6))
plt.title('Listed Occupations apart from "Others" and "Professional"')
plt.ylabel('count')
plt.xlabel('Occupation')
plt.show()
From the graph, we can see that a majority of people have other Occupations that the ones listed, second most common is Professional which is a broad classification and doesnt tell us much. When we remove two highest Occupation classifications, we see that Computer Programmer is the Highest and it gradually reduces till the last which is Student-Technical School.
Which Income Bracket seems to be borrowing the Most ? what about Employment Status Group?¶
loan.IncomeRange=loan.IncomeRange.astype('category')
plt.figure(figsize = [17, 5])
plt.subplots_adjust(wspace = 0.3)
plt.subplot(121)
set_my_style()
sb.set_context('notebook')
ax=loan.IncomeRange.value_counts().plot.bar(alpha=0.9,color=prosper_hue2[1])
plt.title('Income Rate Distribution')
plt.xlabel('Income Bracket')
plt.ylabel('Count')
plt.subplot(122)
ax=loan.EmploymentStatus.value_counts().plot.bar(alpha=0.9,color=prosper_hue2[0])
plt.title('Employment Status')
#plt.xlabel('Loan Status')
plt.ylabel('Count')
plt.xlabel('Employment Status')
plt.show()
sb.distplot(loan.LoanOriginalAmount,bins=5,color='#FE7702')
plt.title("LoanOriginalAmount showing Distribution")
plt.ylabel('Probability')
plt.show()
binsize=500
bins=np.arange(0,loan.LoanOriginalAmount.max()+binsize,binsize)
#fig,
plt.figure(figsize = [17, 5])
plt.subplots_adjust(wspace = 0.3)
loan_mean=loan.LoanOriginalAmount.mean()
ax=plt.subplot(1,2,1)
sb.distplot(loan.LoanOriginalAmount,bins=bins,kde=False)
plt.axvline(loan_mean,color='b',label='Avg. Loan Amount')
ax.text(loan_mean,-500,"{:.0f}".format(loan_mean),rotation=90,color='b')
plt.title("LoanOriginalAmount showing Count")
plt.xticks(rotation=90)
plt.ylabel('count')
plt.legend()
ax=plt.subplot(1,2,2)
sb.distplot(loan.LoanOriginalAmount,bins=bins,color='#FE7702')
plt.title("LoanOriginalAmount showing Distribution")
plt.xticks(rotation=90)
plt.ylabel('Probability')
plt.show()
We have a right skewed multi-modal distribution for the LoanOriginalAmount. The initial figure showing loanOriginalAmount distribution shows that a lot of the loans are more than \$0 and less than $10,000.
I then increased the binsize and added a line that shows the average mark for the loan amounts and we can see clearly that a lot of the loan amounts are concentrated to the left of the graph.
Also, we can now see that the are several peaks at \$5,000, \$10,000, \$15,000 and also to a lesser degree at \$20,000, and \$25,000.
Interestingly, there's a steep jump in frequency at these amounts rather than a smooth ramp up. This suggests that people are more likely to request for loan amounts that are multiples of \$5,000.
Debt to Income Ratio¶
Loan_hist=loan.copy()
Loan_hist.dropna(subset=['DebtToIncomeRatio'],inplace=True)
#loan.DebtToIncomeRatio
plt.figure(figsize=(15,4.4))
plt.subplots_adjust(wspace = 0.5)
plt.subplot(121)
ax=sb.distplot(Loan_hist['DebtToIncomeRatio'],kde=False,
color='#FE7702')
ax.set_ylabel('Count')
ax.set_title('Debt to Income Ratio')
plt.subplot(122)
ax=sb.distplot(Loan_hist['DebtToIncomeRatio'],kde=False,
color='#FE7702')
ax.set_ylabel('Count')
ax.set_xscale('log')
ax.set_xlabel('DebtToIncomeRatio logScale')
ax.set_title('Debt to Income Ratio')
plt.show()
Most of the borrowers have a debt-to-income ratio below 1.0, and the tail gets really small past that number. I limited the x-axis so we can gain more insight from our plot; The new plot shows an outlier value near 10.
Current Delinquencies¶
loan_delinquencies=loan.copy()
loan_delinquencies.dropna(subset=['CurrentDelinquencies'],inplace=True)
#loan_delinquencies.CurrentDelinquencies
plt.figure(figsize=(17,4.4))
plt.subplots_adjust(wspace = 0.3)
plt.subplot(131)
ax=sb.distplot(loan_delinquencies['CurrentDelinquencies'],kde=False,hist=True)
ax.set_ylabel('Count')
ax.set_title('Current Delinquencies')
plt.subplot(132)
ax=sb.distplot(loan_delinquencies['CurrentDelinquencies'],kde=False,hist=True)
ax.set_xscale('log')
ax.set_xlabel('CurrentDelinquenciies logScale')
ax.set_ylabel('Count')
ax.set_title('Current Delinquencies')
plt.subplot(133)
ax=sb.distplot(loan_delinquencies['CurrentDelinquencies'],kde=False)
ax.set_xlim(0,10)
ax.set_xlabel('CurrentDelinquenciies llimited to 10')
ax.set_ylabel('Count')
ax.set_title('Current Delinquencies(x-axis limited to 0>x<10)')
plt.show()
Here most borrowers have 0 current delinquencies. I’ve limited the x-axis to 20 so that we can see the abrupt decline in current delinquencies.
Let's take a look at loan status counts¶
cpal=palettable.matplotlib.Inferno_8.hex_colors
loan.LoanStatus.value_counts()
I classified the LoanStatus into Performing and NonPerforming Categories.
The performing Category consists of;
- The 'Completed' class
- The 'Current' class (since the haven't defaulted)
- The 'FinalPaymentInProgress' class
#classified the LoanStatus into performing and NonPerforming.
loan['Loan_stat1']=(loan.LoanStatus.isin(['Completed','Current','FinalPaymentInProgress'])).astype('category')
loan.Loan_stat1.value_counts()
#loan_stat.Loan_stat=loan_stat.Loan_stat.astype('category')
loan.Loan_stat1.dtype
#rename the newly created loan_stat category column
loan.Loan_stat1.cat.rename_categories({False:'NonPerforming',True:'Performing'},inplace=True)
loan.Loan_stat1.value_counts()
#tried this initially, didnt't work
#loan_stat.Loan_stat.cat.set_categories(['Performing','NonPerforming'],inplace=True)
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
#set_my_style()
plt.subplots(figsize=[16,5])
plt.subplots_adjust(wspace=0.2)
sb.set_context('notebook')
plt.subplot(121)
ax=loan.LoanStatus.value_counts().plot.bar(alpha=0.9,color=prosper_hue2[0])
plt.title('Whats the status of the loans?')
plt.xlabel('Loan Status')
plt.ylabel('Number of Listings')
plt.subplot(122)
ax=sb.countplot(x='LoanStatus',data=loan,hue='Loan_stat1',hue_order=['Performing','NonPerforming'],alpha=0.9,
order=loan.LoanStatus.value_counts().index,
dodge=False,
palette=['#FE7702', '#E10C79'])
ax.legend(loc='upper right',prop={'size':12})
plt.xticks(rotation=90)
ax.set_yscale('log')
ax.set_ylabel('Number of Listings')
#hue_order=np.sort(loan.Loan_stat.unique())
plt.show()
We can see tha a lot of the loans are current and so far, have no issues. It was a bit difficult to compare the other loan status so I used a log scale for the other plot and seperated the Performing from the NonPerforming loans to give us a clearer picture of the loan status.
Lets define HighRisk loans as be loans that are PastDue, Chargeoff or Defaulted; Completed loans are loans in Completed, FinalPaymentInProgress and Cancelled.
'IsBorrowerHomeowner'
ax=loan.IsBorrowerHomeowner.value_counts().plot.bar(alpha=0.9,color=prosper_hue2[0])
plt.title('Is Borrower A Home Owner')
#plt.xlabel('Loan Status')
plt.ylabel('Count')
plt.show()
loan.IsBorrowerHomeowner.value_counts()
From the above plot we can tell that the ratio of home owners to those who do not own homes is almost equal, we did a quick check to verify.
This is the Borrower's Annual Percentage Rate (APR) for the loan. Let's see what we can glean from plots of this data.
loan_BorrowerAPR=loan.copy()
loan_BorrowerAPR.dropna(subset=['BorrowerAPR'],inplace=True)
plt.figure(figsize=(15,4.4))
plt.subplots_adjust(wspace = 0.5)
plt.subplot(121)
ax=sb.distplot(loan_BorrowerAPR.BorrowerAPR,kde=False,hist=True)
ax.set_ylabel('Count')
ax.set_title('Borrower Annual Payment Rate')
plt.subplot(122)
ax=sb.distplot(loan_BorrowerAPR.BorrowerAPR,kde=False,hist=True,bins=300)
ax.set_ylabel('Count')
ax.set_title('Borrower Annual Payment Rate')
plt.show()
loan_BorrowerAPR.BorrowerAPR.mode()
loan_BorrowerAPR.BorrowerAPR.mean()
loan_BorrowerAPR[loan_BorrowerAPR.BorrowerAPR ==(0.35356 ].sample(5)
A lot of the BorrowerAPR occurs at 0.35797 i tried to take a look manually to see if i could determine why but i was unsucessful, I will explore this further later.
plt.figure(figsize=(15,4.4))
plt.subplots_adjust(wspace = 0.5)
plt.subplot(121)
ax=sb.countplot(x='ProsperScore',data=loan,alpha=0.9,palette=[prosper_hue2[0]])
plt
plt.subplot(122)
ax=sb.countplot(x='CreditGrade',data=loan,alpha=0.9,
palette=[prosper_hue2[0]])
‘CreditGrade’ is the pre-2009 rating, and ‘ProsperRating’ is the more sophisticated post-2009 rating. Also, ‘CreditGrade’ has around 29,000 values and ‘ProsperRating’ has a little more than 85,000 values. Incomparison to overall count, there are less High rating scores post 2009 than there were in pre-2009. maybe it is because after the company restructured, they implemented a stricter rating system.
In this section, I renamed the values in some columns hereby putting some of them into broader classifications, I then converted these columns values to categorical values.
#change LoanStatus to category type
loan.LoanStatus=loan.LoanStatus.astype('category')
#make new loanStatus columns
loan['Loan_stat2']=loan['LoanStatus']
loan['Loan_risk']=loan['LoanStatus']
#loan.Loan_stat2.dtype
#loan.Loan_risk.dtype
loan.LoanStatus.value_counts()
clean_Loan_stat_risk={'Loan_stat2':{'FinalPaymentInProgress':'Completed','Cancelled': 'Completed',
'Past Due (1-15 days)': 'PastDue','Past Due (16-30 days)':'PastDue',
'Past Due (31-60 days)':'PastDue','Past Due (61-90 days)':'PastDue',
'Past Due (91-120 days)':'PastDue','Past Due (>120 days)':'PastDue'},
'Loan_risk':{'FinalPaymentInProgress':'Completed','Cancelled': 'Completed',
'Past Due (1-15 days)': 'HighRisk','Past Due (16-30 days)':'HighRisk',
'Past Due (31-60 days)':'HighRisk','Past Due (61-90 days)':'HighRisk',
'Past Due (91-120 days)':'HighRisk','Past Due (>120 days)':'HighRisk',
'Defaulted':'HighRisk','Chargedoff':'HighRisk'}}
#use replace to change categorys values
loan.replace(clean_Loan_stat_risk,inplace=True)
loan.Loan_stat2.value_counts()
loan.Loan_risk.value_counts()
loan.Loan_stat2=loan.Loan_stat2.astype('category')
loan.Loan_risk=loan.Loan_risk.astype('category')
loan.Loan_stat2.dtype
loan.Loan_risk.dtype
#loan.LoanOriginationQuarter.value_counts()
#I would have loved to convert this variable in order to explore it but i'm out of time
loan.LoanOriginationQuarter.dtype
CA has the highest volume of loans and almost double the next following states TX and NY.
The BorrowerAPR is a left-skewed normal distribution and it has a spike at ~0.358. 'DebtToIncomeRatio' as well as 'CurrentDelinquencies' both had a large range of values, so i set the x-axis to a log scale and it shows that most of their values were at 1 and below 10 respectively. i.e both are right-skewed.
We have a right skewed multi-modal disistribution for the LoanOriginalAmount. The initial figure showing loanOriginalAmount distribution shows that a lot of the loans are more than \$0 and \$10,000. I then increased the binsize and added a line that shows the average mark for the loan amounts and we can see clearly that a lot of the loan amounts are concenterated to the left of the graph. Also, we can now see that the are several peaks at \$5,000, \$10,000, \$15,000 and also to a lesser degree at \$20,000, and \$25,000. Interestingly, there's a steep jump in frequency at these amounts rather than a smooth ramp up. This suggests that people are more likely to request for loan amounts that are multiples of \$5,000.
For LoanStatus, for future visualizations, I changed all the different past due values and i changed the column dtype to categorical.
#Remove outstanding loans
loan_historical = loan[loan["Loan_risk"] != "Current"]
#df_historical["LoanStatus"].value_counts()
plt.figure(figsize=(15,5))
plt.subplot(121)
ax=sb.stripplot(data=loan,x='Loan_risk',y='LoanOriginalAmount',
edgecolor='W',linewidth=0.5,size=8,palette=prosper_hue3_plasma,
jitter=True,dodge=True)
ax.set_xlabel('Loan Status')
ax.set_ylabel(' Original Loan Amount')
ax.set_title('Loan Amount vs RIsk')
plt.subplot(122)
ax=sb.stripplot(data=loan,palette=prosper_hue_11_plasma,x='ProsperScore',y='LoanOriginalAmount',
edgecolor='W',linewidth=0.4,size=8,
jitter=True,dodge=True)
ax.set_xlabel('Prosper Score')
ax.set_ylabel('Original Loan Amount')
ax.set_title('Loan Amount vs ProsperScore')
plt.show()
#from palettable.colorbrewer.sequential import Blues_8
sb.set_context('notebook')
plt.figure(figsize=(18,6))
ax=sb.countplot(x='LoanStatus',data=loan,hue='IncomeRange',alpha=0.9,palette=prosper_hue_medium_plasma)
ax.legend(loc='upper right',prop={'size':12})
plt.xticks(rotation=90)
ax.set_yscale('log')
plt.title('LoanStatus Vs IncomeRange')
plt.show()
#from palettable.colorbrewer.sequential import Blues_8
sb.set_context('notebook')
plt.figure(figsize=(18,6))
ax=sb.countplot(x='LoanStatus',data=loan,hue='ProsperScore',alpha=0.9,palette=prosper_hue_11_plasma)
ax.legend(loc='upper right',prop={'size':12})
plt.xticks(rotation=90)
ax.set_yscale('log')
plt.title('LoanStatus Vs ProsperScore')
plt.show()
#loan['ProsperRating (numeric)']=loan['ProsperRating (numeric)'].astype(int)
#loan_test_df.sort_values(by='ListingKey')
#created a df for loan AailableBankCreditCard with out the outlier values
loan_ABCC_less_90=loan[loan.AvailableBankcardCredit<90000]
#plotted regression for BorrowerRate vs DebtTpIncomeRatio and BorrowerRate Vs AvailableBankCreditCard
plt.figure(figsize=(15,4.4))
plt.subplots_adjust(wspace = 0.5)
plt.subplot(121)
ax=sb.regplot(data=loan,x='DebtToIncomeRatio',y='BorrowerAPR')
plt.subplot(122)
ax=sb.regplot(data=loan_ABCC_less_90,y='BorrowerAPR',x='AvailableBankcardCredit')
#ax.set_ylim(0,0.4)
plt.show()
binsize=500
bins=np.arange(0,loan.LoanOriginalAmount.max()+binsize,binsize)
#fig,
plt.figure(figsize = [17, 5])
plt.subplots_adjust(wspace = 0.3)
sb.set_palette(prosper_hue2)
plt.subplot(121)
ax=sb.distplot(loan.LoanOriginalAmount[loan.BorrowerState=='CA'],kde=False,label='Califonia')
ax=sb.distplot(loan.LoanOriginalAmount[loan.BorrowerState=='TX'],kde=False,label='Texas')
ax=sb.distplot(loan.LoanOriginalAmount[loan.BorrowerState=='NY'],kde=False,label='NewYork')
ax=sb.distplot(loan.LoanOriginalAmount[loan.BorrowerState=='FL'],kde=False,label='Florida')
ax.legend()
plt.title("Origial Loan Amount for Top 4 states")
#plt.xticks(rotation=90)
plt.ylabel('count')
#ax.set_yscale('log')
#sb.distplot(loan.LoanOriginalAmount,bins=bins,color='#FE7702')
plt.subplot(122)
ax=sb.kdeplot(loan.LoanOriginalAmount[loan.BorrowerState=='ND'],shade=True,label='North Dakota')
ax=sb.kdeplot(loan.LoanOriginalAmount[loan.BorrowerState=='ME'],shade=True,label='Maine')
ax=sb.kdeplot(loan.LoanOriginalAmount[loan.BorrowerState=='WY'],shade=True,label='Wyoming')
ax=sb.kdeplot(loan.LoanOriginalAmount[loan.BorrowerState=='IA'],shade=True,label='Iowa')
ax.legend()
plt.xlabel('Original Loan Amount')
plt.title("Origial Loan Amount for bottom 4 states")
plt.show()
plt.figure(figsize=(15,4.4))
plt.subplots_adjust(wspace = 0.5)
plt.subplot(121)
ax=sb.countplot(x='ProsperScore',data=loan,hue='Loan_stat1',hue_order=['Performing','NonPerforming'],alpha=0.9,
palette=['#FE7702', '#E10C79'])
plt.subplot(122)
ax=sb.countplot(x='CreditGrade',data=loan,hue='Loan_stat1',hue_order=['Performing','NonPerforming'],alpha=0.9,
palette=['#FE7702', '#E10C79'])
Graph of Prosper Score vs LoanStatus shows that it has a left-skewed distribution for completed loan, which means completed loans have primarily good ratings. However, Prosper score distributed a bell-shaped in high risk loan.
This is particularly observed in past-Due (32-60 days). Compared to Prosper Rating with a left-skewed shape, it seems like Prosper Score presents a less ability to detect the high risk loans.
From the plot of both ProsperScore and CcreditGrade vs LoanStatus(i.e Performing and NonPerforming Loans), I observed that Prosper Score was wayy more efficient at determining the performance of a loan than creditGrade was. This indicates that the risk management system at Prosper is efficient.
BorrowerAPR seems to be positively correlated with DebtToIncomeRatio and negatively correlated with AvailableBankCardCredit. It seems that a higher available bank card credit will lower your interest rate. I theorize that when a borrower has relatively low credit amount the probabilities of getting low and high interest rates are similar, but with a high available bank credit, an individual is more likely to get a lower interest rate.
# boxplot of 'Loan Amount vs Income Range vs LoanStatus'
plt.figure(figsize=(16,7))
ax=sb.boxplot(data=loan,hue=loan.Loan_stat1,x=loan.IncomeRange,y=loan.LoanOriginalAmount,dodge=True)
plt.xticks(rotation=90)
ax.set_title('Loan Amount vs Income Range vs LoanStatus')
plt.show()
#plt.figure(figsize=(22,22))
g = sb.PairGrid(loan, vars=['TotalProsperLoans', 'LoanOriginalAmount', 'AmountDelinquent', 'DebtToIncomeRatio'],
hue='EmploymentStatus', palette=prosper_hue_medium_plasma)
g.map(plt.scatter, alpha=0.8)
g.add_legend();
plt.show()
#plt.figure(figsize=(15,7))
sb.set_context('notebook')
sb.lmplot(data=loan,size=7,aspect=2,x='DebtToIncomeRatio',y='BorrowerAPR',hue='CreditGrade',
fit_reg=False,palette=prosper_hue_medium_plasma_r)
plt.show()
init_notebook_mode(connected=True)
loan_no_nan=loan.copy()
loan_no_nan2=loan.copy()
loan_no_nan.dropna(subset=['ProsperScore'],inplace=True)
loan_no_nan2.dropna(subset=['CreditGrade'],inplace=True)
px.scatter(loan_no_nan2,trendline='ols',x='LoanOriginalAmount',
y='AmountDelinquent',color='CreditGrade',
log_x=False,facet_col='Loan_risk',
log_y=True)
px.scatter(loan_no_nan,trendline='ols',trendline_color_override='Black',x='LoanOriginalAmount',
y='StatedMonthlyIncome',color='ProsperScore',log_x=True,facet_col='Loan_risk',
log_y=True)
loan_Borrower=loan.copy()
loan_Borrower2=loan.copy()
loan_Borrower.dropna(subset=['CreditGrade'],inplace=True)
loan_Borrower2.dropna(subset=['ProsperScore'],inplace=True)
px.box(loan_no_nan,x='ProsperScore',color='IsBorrowerHomeowner',
y='BorrowerAPR',log_x=False,facet_col='Loan_stat1',
log_y=False)
loan_parallel=loan.copy()
loan_parallel['Par_col']=loan_parallel['Loan_risk']
loan_parallel['Par_col2']=loan_parallel['Loan_stat2']
clean_Loan_risk={'Par_col':{'Completed':1,'Current': 2,'HighRisk':3},
'Par_col2':{'Completed':1,'Current':2, 'PastDue':3, 'Defaulted':4, 'Chargedoff':5}}
loan_parallel.replace(clean_Loan_risk,inplace=True)
loan_parallel.Par_col.dtype
#loan_parallel_no_na=loan_parallel.dropna(subset=['ProsperScore'])
#loan_parallel_no_na.info()
init_notebook_mode(connected=True)
df=loan_parallel
data = [
go.Parcoords(
line = dict(color = df['Par_col'],
colorscale = [[0,'#84807A'],[0.5,"#FE7702"],[1,'#E10C79']]),
#line = dict(color = df['colorVal'],
# colorscale = 'Jet',
# showscale = True,
# reversescale = True,
# cmin = -4000,
# cmax = -100),
dimensions = list([
dict(range = [100,35000],
constraintrange=[20000,25000],
label = 'Original Loan Amount', values = df['LoanOriginalAmount']),
dict(tickvals = [1,2,3,4,5],
constraintrange=[4,5],
ticktext = ['Completed', 'Current', 'PastDue', 'Defaulted', 'Chargedoff'],
label = 'Loan Status', values = df['Par_col2']),
dict(range=[0.0,10.01],
label = 'DebtToIncomeRatio', values = df['DebtToIncomeRatio']),
dict(range = [0.0,175500],#monthly income range too high 0.0-1755000 reduce it
visible = True,
label = 'Monthly Income(Stated)', values = df['StatedMonthlyIncome']),
dict(range = [0,40],
label = 'Recommendations', values = df['Recommendations']),
dict(range = [1,1189],
label = 'Investors', values = df['Investors']),
dict(range = [0,8],
label = 'TotalProsperLoans', values = df['TotalProsperLoans']),
dict(range = [-1,12],
constraintrange=[9,11],
tickvals = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0],
label = 'Prosper Score', values = df['ProsperScore'])
#[0.0,646285.0] also add incomerange
])
)
]
layout = go.Layout(
title='Parallel plot showing interaction of 9 different variables',
plot_bgcolor = '#E5E5E5',
paper_bgcolor = '#E5E5E5',
)
fig = go.Figure(data = data, layout = layout)
plotly.offline.iplot(fig,filename='Prosper_loans_parallel')
#plotly.offline.plot(fig, filename = 'Proser_loans.html')
#Occupation_group.dropna(subset=['Prosper_score_clean'],inplace=True)
From the BoxPlot of the LoanOriginal amount vs Income Range, when grouped by performing vs Non-Performing loans, we observe that the two categories where non-performing loans is higher than performing loans are in the \$0 income range and the non employed.
Borrower ARP decreases with improving Prosper Score. For performing loans, it was observed that prosperScore of >10 enjoy the least BorrowerAPR, with a relatively short range of about 0.05-0.14, and it seems Home owners enjoy an even smaller BorrowerARP compared to non-home owners.
For the non-performing loan group, from a prosperScore of 0-5, the borrower Annual Percentage Rate is about the same with variation in range, however from the rating score of >6, it decreases with improving score. Also here, the BorrowerAPR for home owners appears to be lower than for non-home owners.
From the plot of StatedMonthlyIncome vs LoanOriginalAmount, we can see that the ordinary leeast square regression line shows an upward trend, meaning that as statedMonthly income increased, the loan amoint increased. this trend is more apparent in the prosperScore group of 10.0
For BorrowerAPR vs DebttoIncomeRatio, firstly, most people tend to have debt-to-income ratios below 1, regardless of risk category. Iobserved alos that AA category loans seem to have a lower BorrowerAPRs and a smaller range of debt-to-income ratios, both of which indicate less risk and APR increases as the rating gets riskier. Most people tend to have debt-to-income ratios below 1, regardless of risk category.
From the 'Parallel plot showing interaction of 9 different variables' I observed that they were a lot of loans which had a prosperScore of 9,10 and 11 that were either chargedoff or defaulted,some of these loans with original loan amount between \$20,000 and \$25,000 had monthlyIncome Amounts close to the borrowed amounts.